A Parallel Scalable Infrastructure for OLAP and Data Mining
نویسندگان
چکیده
Decision support systems are important in leveraging information present in data warehouses in businesses like banking, insurance, retail and health-care among many others. The multi-dimensional aspects of a business can be naturally expressed using a multi-dimensional data model. Data analysis and data mining on these warehouses pose new challenges for traditional database systems. OLAP and data mining operations require summary information on these multi-dimensional data sets. Query processing for these applications require different views of data for analysis and effective decision making. Data mining techniques can be applied in conjunction with OLAP for an integrated business solution. As data warehouses grow, parallel processing techniques have been applied to enable the use of larger data sets and reduce the time for analysis, thereby enabling evaluation of many more options for decision making. In this paper we address (1) scalability in multidimensional systems for OLAP and multi-dimensional analysis, (2) integration of data mining with the OLAP framework, and (3) high performance by using parallel processing for OLAP and data mining. We describe our system PARSIMONY Parallel and Scalable Infrastructure for Multidimensional Online analytical processing. This platform is used both for OLAP and data mining. Sparsity of data sets is handled by using sparse chunks using a bitencoded sparse structure for compression, which enables aggregate operations on compressed data. Techniques for effectively using summary information available in data cubes for data mining are presented for mining Association rules and decision-tree based Classification. These take advantage of the data organization provided by the multidimensional data model. Performance results for high dimensional data sets on a distributed memory parallel machine (IBM SP-2) show good speedup and scalability.
منابع مشابه
PARSIMONY: An Infrastructure for Parallel Multidimensional Analysis and Data Mining
Multidimensional analysis and online analytical processing (OLAP) operations require summary information on multidimensional data sets. Most common are aggregate operations along one or more dimensions of numerical data values. Simultaneous calculation of multidimensional aggregates are provided by the Data Cube operator, used to calculate and store summary information on a number of dimensions...
متن کاملA Data-Warehouse/OLAP Framework for Scalable Telecommunication Tandem Traffic Analysis
In a telecommunication network, hundreds of millions of call detail records (CDRs) are generated daily. Applications such as tandem traffic analysis require the collection and mining of CDRs on a continuous basis. The data volumes and data flow rates pose serious scalability and performance challenges. This has motivated us to develop a scalable datawarehouse/OLAP framework, and based on this f...
متن کاملHigh Performance Data Mining Using Data Cubes on Parallel Computers
On-Line Analytical Processing techniques are used for data analysis and decision support systems. The multidimensionality of the underlying data is well represented by multidimensional databases. For data mining in knowledge discovery, OLAP calculations can be effectively used. For these, high performance parallel systems are required to provide interactive analysis. Precomputed aggregate calcu...
متن کاملScalable real-time OLAP on cloud architectures
In contrast to queries for on-line transaction processing (OLTP) systems that typically access only a small portion of a database, OLAP queries may need to aggregate large portions of a database which often leads to performance issues. In this paper we introduce CR-OLAP, a scalable Cloud based Real-time OLAP system based on a new distributed index structure for OLAP, the distributed PDCR tree. ...
متن کاملDesign and Implementation of a Scalable Parallel System for Multidimensional Analysis and OLAP
Multidimensional Analysis and On-Line Analytical Processing (OLAP) uses summary information that requires aggregate operations along one or more dimensions of numerical data values. Query processing for these applications require different views of data for decision support. The Data Cube operator provides multi-dimensional aggregates, used to calculate and store summary information on a number...
متن کامل